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?—( , Abstract. We solve an open problem related to an optimal encoding of 

a straight line program (SLP), a canonical form of grammar compression 
deriving a single string deterministically. We show that an information- 
theoretic lower bound for representing an SLP with n symbols requires 
at least 2n + logn! + o{n) bits. We then present a succinct representa- 
tion of an SLP; this representation is asymptotically equivalent to the 

C/j ■ lower bound. The space is at most 2nlogp(l -I- o(l)) bits for p < 2y/n, 

[] ' while supporting random access to any production rule of an SLP in 

O(loglogn) time. In addition, we present a novel dynamic data struc- 
fj ■ ture associating a digram with a unique symbol. Such a data structure 

is called a naming function and has been implemented using a hash ta- 
ble that has a space-time tradeoff. Thus, the memory space is mainly 
occupied by the hash table during the development of production rules. 
Alternatively, we build a dynamic data structure for the naming func- 
tion by leveraging the idea behind the wavelet tree. The space is strictly 

^\. ' bounded by 2nlogn(l -I- o(l)) bits, while supporting O(logn) query and 

f~^ I update time. 

^' 

f^ '. 1 Introduction 

Grammar compression has been an active research area since at least the seven- 
ties. The problem consists of two phases: (i) building the smallesiHI context-free 
^\ , grammar (CFG) generating an input string uniquely and (ii) encoding an ob- 

tained CFG as compactly as possible. 

The phase (i) is known as an NP-hard problem which can not be approx- 
imated within a constant factor [3T]. Therefore, many researchers have made 
considerable efforts to design grammar compressions achieving better approxi- 
mation results in the last decade. Charikar et al. [B] and Rytter |21] indepen- 
dently proposed the first 0(log -)-approximation algorithms based on balanced 
grammar construction for the length u of a string and the size g of the smallest 
CFG. Later, Sakamoto [31] also developed an 0(log -)-approximation algorithm 
based on an idea called pairwise comparison. In particular, Lehman |21) proved 
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that LZ77 [3S] achieved the best approxnuation of 0(log7i) under the condition 
of an unlimited window size. Since the minimum addition chain problem is a 
special case of the problem of finding the smallest CFG [20 , modifying the ap- 
proximation algorithms proposed so far is a difficult problem. Thus, the problem 
of grammar compression is pressing in the phase (ii). 

A straight line program (SLP) is a canonical form of a CFG, and has been 
used in many grammar compression algorithms J36I19I35I1I23J . The production 
rules in SLPs are in Chomsky normal form where the right hand side of a 
production rule in CFGs is a digram: a pair of symbols. Thus, if n symbols 
are stored in an array called a phrase dictionary consisting of 2n fixed-length 
codes each of which is represented by log n bits, the memory of the dictionary 
is 2n\ogn bits, resulting in the memory for storing an input string usually be- 
ing exceeded. Although directly addressable codes achieving entropy bounds on 
strings whose memory consumption is the same as that of the fixed-length codes 
in the worst case have been presented |8I30I11I , there are no codes that achieve an 
information-theoretic lower bound of storing an SLP in a phrase dictionary. Since 
a nontrivial information-theoretic lower bound of directly addressable codes for 
a phrase dictionary remains unknown, establishing the lower bound and devel- 
oping novel codes for optimally representing an SLP are challenges. 

We present an optimal and directly addressable SLP within a strictly bounded 
memory close to the amount of a plain representation of the phrase dictionary. 
We first give an information-theoretic lower bound on the problem of encoding 
an SLP, which has been unknown thus far. Let C be a class of objects. Repre- 
senting an object c G C requires at least log|C| bits. A representation of c is 
succinct if it requires at most log |C|(l-|-o(l)) bits. Considering the facts and the 
characteristics of SLPs indicated in [23], one can predict that the lower bound 
for the class of SLPs with n symbols would be between 2n and 4n -I- logn!. By 
leveraging this prediction, we derive that a lower bound of bits to represent SLPs 
is 2n -I- logn!. 

We then present an almost optimal encoding of SLPs based on monotonic 
subsequence decomposition of a sequence. Any permutation of [l,n] is decom- 
posable into at most p < 2\fn monotonic subsequences in 0(n^'^^ time j33j and 
there is a 1.71-approxiniatior|3 algorithm in 0[r?) time ^ . While the previous 
encoding method for SLPs presented in [32 is also based on the decomposition, 
the size is not asymptotically equal to the lower bound when p ~ y^. We im- 
prove the data structure by using the wavelet tree (WT) |12| and its improved 
results [3 10: such that our novel data structure achieves the smaller bound of 
min{2n + ri log 71 4-0(1), 27ilog/3(l + o(l))} bits for any SLP with n symbols while 
supporting O(loglogp) access time. Our method is applicable to any types of 
algorithm generating SLPs including Re-Pair 19 and an online algorithm called 
LCA [21]. Barbay et al. [1] presented a succinct representation of a sequence 
using the monotonic subsequence decomposition. Their method uses the rep- 
resentation of an integer function built on a succinct representation of integer 
ranges. Its size is estimated to be the degree entropy of an ordered tree [14] . 



Minimizing p is NP-hard. 



Another contribution of this paper is to present a dynamic data structure 
for checking whether or not a production rule in a CFG has been generated 
in execution. Such a data structure is called a naming function, and is also 
necessary for practical grammar compressions. When the set of symbols is static, 
we can construct a perfect hash as a naming function in linear time, which 
achieves an amount of space within around a factor of 2 from the information- 
theoretical minimum 5 . However, variables of SLPs are generated step by step 
in grammar compression. While the function can be dynamically computed by a 
randomization [17 or a deterministic solution [16J in 0(1) time and linear space, 
a hidden constant in the required space was not clear. We present a dynamic 
data structure to compute function values in O(logn) query time and update 
time. The space is strictly bounded by 27ilogri(l + o(l)) bits. 



2 Preliminaries 

2.1 Grammar compression 

For a finite set C, \C\ denotes its cardinality. Alphabet Z" is a finite set of letters 
and CT = |i7| is a constant. A' is a recursively enumerable set of variables with 
Z" n A" = 0. A sequence of symbols from S U X is called a string. The set of all 
possible strings from S is denoted by S* . For a string S, the expressions \S\, 
S[i], and S[i,j] denote the length of S, the z-th symbol of S, and the substring 
of S from S[i] to S[j], respectively. Let [S] be the set of symbols composing S. 
A string of length two is called a digram. 

A CFG is represented hy Q = {S, V, P, Xs) where y is a finite subset of X, 
P is a finite subset oi V x {V U X)* , and Xs ^ V. A member of P is called 
a production rule and Xg is called the start symbol. The set of strings in E* 
derived from Xg by G is denoted by L{Q). 

A CFG Q is called admissible if exactly one X — >• a e P exists and |i(^)| — 1- 
An admissible G deriving S is called a grammar compression of S for any X E V. 

We consider only the case \a\ = 2 for any production rule X — )• a because 
any grammar compression with n variables can be transformed into such a re- 
stricted CFG with at most 2n variables. Moreover, this restriction is useful for 
practical applications of compression algorithms, e.g., LZ78 [36], REPAIR [19], 
and LCA ^, and indices, e.g., SLP [7] and ESP ^. 

The derivation tree of G is represented by a rooted ordered binary tree such 
that internal nodes are labeled by variables in V and the yields, i.e., the sequence 
of labels of leaves is equal to S. In this tree, any internal node Z G V has a left 
child labeled X and a right child labeled Y, which corresponds to the production 
rule Z -^ XY. 

If a CFG is obtained from any other CFG by a permutation tt : S U V ^ 
E UV, they are identical to each other because the string derived from one is 
transformed to that from the other by the renaming. For example, P — {Z -^ 
XY, F — >• a&, X — >• aa} and P' = {X — )• YZ, Z ^ ab,Y ^ aa} are identical each 
other. On the other hand, they are clearly different from P" = {Z -^ aY, Y — >■ 



bX, X — > aa} because their depths are different. Thus, we assume the fohowing 
canonical form of CFG called straight line program (SLP). 

Definition 1. (Karpinsk-Rytter-Shinohara 1181) An SLP is a grammar com- 
pression over SUV whose production rules are formed by either X^ ^ a or 
Xk — >■ XiXj, where a € S and 1 < i,j < k < \V\. 

2.2 Phrase/reverse dictionary 

For a set P of production rules, a phrase dictionary D is a data structure for 
directly accessing the phrase XiXj for any Xk G V^ if Xk —> XiXj e P. Re- 
garding a triple {k,i,j) of positive integers as Xk —J- XiXj, we can store the 
phrase dictionary consisting of n variables in an integer array D[l,2n], where 
D[2k - 1] = D[2k] = if fc belongs to an alphabet i.e., 1 < fc < \S\. X, and Xj 
are accessible as D[2k — 1] and D[2k] by indices 2k — 1 and 2k for Xk, respec- 
tively. A plain representation of D using fixed-length codes requires 2n log n bits 
of space to store n production rules. 

Reverse dictionary D~^ is a data structure for directly accessing the variable 
Xk given XiXj for a production rule Xk -^ XiXj £ P. Thus, D^^{XiXj) 
returns Xk if Xk — > XiXj <£ P. A hash table is a representative data structure 
for D^^ enabling 0(1) time access and achieving O(nlogri) bits of space. 

2.3 Rank/select dictionary 

We present a phrase dictionary based on the rank/select dictionary, a data struc- 
ture for a bit string B [T3] supporting the following queries: rankc(-B,«) returns 
the number of occurrences of c £ {0, 1} in i?[l, i] and selectc(i3, i) returns the po- 
sition of the j-th occurrence of c S {0, 1} in B. For example, ii B = 10110100111 
is given, then ranki(S', 7) = 4 because the number of Is in -B[l,7] is 4, and 
selecti(5', 5) = 9 because the position of the fifth 1 in S is 9. Although naive 
approaches require the 0(|i3|) time to compute a rank, several data structures 
with only the \B\ + o{\B\) bit storage to achieve 0(1) time |26l27j have been 
presented. Most methods compute a select query by a binary search on a bit 
string B in 0(log \B\) time. A data structure for computing the select query in 
0(1) time has also been presented |28) . 

2.4 Wavelet tree 

A WT is a data structure for a string S G S* , and it can be used to compute the 
rank and select queries on a string S over an ordinal alphabet in O (log cr) time 
and nlogcr(l -I- o(l)) bits [T^]. Data structures supporting the rank and select 
queries in O(loglogo-) time with the same space have been proposed |10l3j . WT 
also supports access(S', i) which returns S[i] in 0{\oga) time. Recently, WT has 
been extended to support various operations on strings ^^. 

A WT for a sequence S over S = {1, ...,a} is a binary tree that can be, recur- 
sively, presented over a sub-alphabet range [a,b] C [1,0-]. Let Sy be a sequence 
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Fig. 1. Example of wavelet tree for a sequence 5* 
{1,2,3,4}. 



342112243 over an alphabet 



represented in a node v, and let left{v) and right{v) be left and right children of 
node V, respectively. The root Vroot represents Sroot — S over the alphabet range 
[1, cr]. At each node v, Sv is split into two subsequences Sieft{v) consisting of the 
sub-alphabet range [a, [ ^^ J] for left{v) and Sright{v) consisting of the sub- 
alphabet range [[ 2 J + -'^'^l '^'-'^ right{v) where Sieft{v) and Sright(v) keep the 
order of elements in Sy. The splitting process repeats until a = b. Each node v 
in the binary tree contains a rank/select dictionary on a bit string By . Bit By [k] 
indicates whether Sy[k] should be moved to left(v) or right{v). If By\k] = 0, 
Sieft{v) contains ^^[fc]. If i3i,[A:] — 1, Sright(v) inherits ^^[fc]. Formally, By[k] with 
an alphabet range [a, h] is defined as: 



By[k] = 



liiSy[k]> l{a + b)/2\ 
QiiSy[k] < L(a-f-6)/2j 



An example of a WT is shown in Figure [T] In this example, since S'root[2] = 4 
belongs to the higher half [3,4] of an alphabet range [1,4] represented in the 
root; therefore, it is the second element of Sroot that must go to the right child 
of the root, Brooti'^] = 1 and Srightiroot)^^] = Sroot[2] = 4. 

3 Succinct SLP 



3.1 Information-theoretic lower bound 

In this section, we present a tight lower bound to represent SLPs having a set 
of production rules P consisting of n = jZ" U T^j symbols. Each production rule 
Z -> XY e P is considered as two directed edges {Z,X) and {Z,Y), the SLP 
can be seen as a directed acyclic graph (DAG) with a single source and \IJ\ 
sinks. Here, we consider {Z,X) as the left edge and {Z,Y) as the right edge. 
In addition, P can be considered as a DAG with the single source and with a 
single sink by introducing a super-sink s and drawing directed left and right 
edges from any sink to s (Figure [2]) ■ Let 'DAQ{n) be the set of all possible Gs 
with n nodes and VAQ = Uin-oo ■^"^^('^)- Since two SLPs are identical if an 
SLP can be converted to the other SLP by a permutation ir: SUV—>-SUV, 
the number of different SLPs is |2?ylt/(?T,)|. Any internal node of G G 'DAQ{n) 
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Fig. 2. Example of DAG representation of an SLP and its spanning tree decomposition. 
An SLP is represented by a DAG G. G is decomposed into tlie left tree Tl and right 
tree Tr. 



has exactly two (left /right) edges. Thus, the following fact remarked in 
true. 
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Fact 1 An in-branching spanning tree is an ordered tree such that the out-degree 
of any node except the root is exactly one. For any in-branching spanning tree 
of G, the graph consisting of the remaining edges and their adjacent vertices is 
also an in-branching spanning tree of G. 

The in-branching spanning tree consisting of the left edges (respectively the 
right edges) and their adjacent vertices is called the left tree Tl (respectively 
right tree Tfs) of G. Note that the source in G is a leaf of both Tl and T^j, and 
the super-sink of G is the root of both Tl and Tr. We shall call the operation 
of decomposing a DAG G into two spanning trees Tl and Tr spanning tree 
decomposition. In Figure [21 the source 2:5 in G is a leaf of both Tl and Tr, and 
the super-sink s in G is the root of both Tl and Tr. 

Any ordered tree is an elements in T — Un-s-oo "^ where Tn is the set of all 
possible ordered trees with n nodes. As shown in |2I34| . there exists an enumer- 
ation tree for T such that any T (£ T appears exactly once. The enumeration 
tree is defined by the rightmost expansion, i.e., in this enumeration tree, a node 
T' e Tn+i, which is a child of T G 7^, is obtained by adding a rightmost node to 
T. In our problem, an ordered tree T G Tn+i is identical to a left tree Tl with 
n-\-l nodes for n = 117 U y| symbols. 

Let G ® (u, v) be the DAG obtained by adding the edge {u, v) to a DAG G. 
If necessary, we write G ® (u, v)l to indicate that {u, v) is added as a left edge. 
For a set E of edges, the DAG G e -B is defined analogously. The DAG G ® ^ is 
defined as adding all the edges {u, v) £ E to G. The DAG G -E is also defined 
as deleting all the edges (u, v) £ E from G. 

Theorem 1. The information-theoretic lower bound on the minimum number 
of bits needed to represent an SLP with n symbols is 2n -\- logn! -t- o{n). 

Proof. Let S{n) be the set of all possible DAGs with n nodes and a single 
source/sink such that any internal node has exactly two children. This S{n) is a 



super set of VAQin) because the in-degree of the sink of any DAG in 'DAQ{n) 
must be exactly 2ct, whereas S{n) does not have such a restriction. By the 
definition, \S{n)\/n'' < \VAgin)\ < \S{n)\ holds. 

Let S{n,T) = {G € S{n) \ G ^ T (S Tr, Tr e Tn}- We show \S{n,T)\ = 
(n — 1)! for each T £ Tn by induction on n > 1. Since the base case n = 1 is 
clear, we assume that the induction hypothesis is true for some n > 1. 

Let T^ be the rightmost expansion of T^ such that the rightmost node u is 
added as the rightmost child of node w in T^,, and let G" e 5(n + l,r^) with a 
left tree T[. By the induction hypothesis, the number of G G S{n, Tl) is [n— 1)! 
and Tl is embedded into G as the left tree. Then, G" is constructed by adding 
the left edge (w, v) and a right edge (u, x) for a node x in T^. 

Let s be the source of G. For v = s, each G' = G (B {u,v)l © {u,x)r £ 
S(ri + 1,T^) is admissible, and the number of them is clearly n\S{n,TL)\ = n\. 
For V ^ s, a X — s, G' — G (B {u,v)l © {u,x)r e S{n + 1,TX) is admissible. 
Otherwise, there exists the lowest common ancestor y of s and x on Tr with 
G = Tl®Tr. Then, G" = G©(m, w)l©(m, a:)fl©(2/', y)ii©(y','u)fl is an admissible 
DAG in S{n + 1, T^) where y' is the unique child of y in the path from y to s in 
Tr. In this case, the number of such G's is also n\ because no edge is changed 
in Ti and the pair (T'j^^T'j^) containing the edge {u,x)r is unique for any fixed 
T^. Thus, \S{n + l,r)| = n! is true for each T e Tn+i- 



This resuh derives \S{n)\ = G„(n - 1)! where G„ = zArrC"') ^ 2^"n'^/^ 
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the number of ordered trees with n + 1 nodes. Combining this with \S{n)\/n'^ < 
|5(n)| < |5(n)| as well, we get the result that the information-theoretic minimum 
bits needed to represent G € VAGin) is at least 2n + logn! + o{n). D 

3.2 An optimal SLP representation 

We present an optimal reresentation of an SLP as an improvement of the data 
structure recently presented in [32 . We apply the spanning tree decomposition 
to the DAG G of a given SLP, and obtain the DAG Tl®Tr{^ G). We rename the 
variables in Tj, by breadth-first order and also rename variables in Tr according 
to the Tl. Let G' be the resulting DAG from G. Then, for the array represen- 
tation D[l,2n] of G', we obtain the condition D[l] < D[3] < ... < D[2n - 1]. 
Since this monotonic sequence is encoded by 2n + o{n) bits, D is represented 
by 2n + nlogn + o{n) bits supporting access{D,k) (1 < fc < 2n) in 0(1) time. 
We focus on the remaining sequence of length n, i.e., D[2],D[4:], . . . , D[2n]. For 
simplicity, we write D instead of [I?[2], Z?[4], . . . , _D[2n]]. 

Let S — {si, . . . ,Sp} be a disjoint set of subsequences of [l,n] such that 
any i G {l,2,...,n} is contained in some Sk and any Si,Sj [i ^ j) are dis- 
joint. Such an S is called a decomposition of D. A sequence D[ski], • • ■ : -^[■Sfcp] 
is weakly monotonic if it is increasing, i.e., -D[sfcJ < . . . < D[sk ] or decreasing, 
i.e., Z)[sfcJ > ... > D[sk ]. In addition, S is called monotonic if the sequence 
DfsfcJ, . . . , D[skp\ is weakly monotonic for any Sk = [sk^ ,■■■ , SkJ S S. 

Theorem 2. Any SLP with n symbols can be represented using 2nlogp(l4-o(l)) 
bits for p < 2^fn, while supporting O(loglogp) access time. 
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D[5] = 4 is obtained by Dp[5] = 1, B[l] = 0, 

ranki(Z)p, 5) = 3, selecti(D^, 3) = 5, 
selecti(B,5) = 9,ranko(-B, 9) = 4. 

Fig. 3. Encoded phrase dictionary: D indicates tlie remaining sequence 
-D[2], Z)[4], . . . , D[2n]. D is encoded by (Dp,Z)^,B,b) based on a monotonic decom- 
position 5 of D, i.e., each s £ S indicates a weakly monotonic subsequence in D; Dp is 
the sequence of i indicating the membership for some Si £ S, Dt^ is a permutation of 
Dp with respect to the corresponding value in D, B is a binary encoding of the sorted 
D in increasing order. We show only the case that D[i\ is a member of an increasing 
s £ S, but the other case is similarly computed by b. 



Proof. It is sufficient to prove that any D of length n can be represented using 
2nlogp + o{n) bits for some p < 2y^. By the result in [33 , we can construct a 
monotonic decomposition 5 of 13 such that p = \S\ < 1%/n. 

We represent the sequence 13 as a four-tuple (I3p, I3^,B, b) using S. For 
each 1 < p < n, Dp\p\ = fc iff p is a member of s/c G 5 for some 1 < fc < p. 
Let {B\l\,Dp{\\),...,{D{n\,Dp{n\) be the sequence of pairs (Z3[p], Dp[p]) (1 < 
p < n). We sort these pairs with respect to the keys D[p] {1 < p < n) and 
obtain the sorted sequence {D[£i],Dp[ii]), . . . , {D[£n],Dp[£n])- We define Dj^ as 
the permutation Dp[£i] ■ ■ ■ Dp[£n]- 

B e {0, 1}* is defined as the bit string 



B = o^[^illO^[' 



]-D\. 
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Finally, b[fc] — ii Sk (z S is increasing and b[fc] — 1 otherwise for 1 < 
k < p. D and Dp are represented by WTs, respectively, and B is a rank/select 
dictionary. 

We recover D[p] using (_Dp, _D^, B,b). When Dp[p] — k and b[fc] = 0, i.e., 
D[p\ is included in the fc-th monotonic subsequence Sk £ S that is increasing, 
we obtain 

D[p] =ranko(B, select i(B,^)) 

by ^ = selectfe(-D7r, i'ankfc(I3p,p)). When Dp[p] = k and b[fc] — 1, we can similarly 
obtain D[p] replacing ^ by r = selectfc (I^tt , (rankfc(I3p,n) + 1 — rankfc(Z3p,p))). 

The total size of the data structure formed by (-Dp, I37r,B, b) is at most 
2nlogp(l + o{l)) bits. The rank/select/access operations of the WT for a static 



sequence over p < 2^/n symbols can be improved to achieve O(loglogp) time 
for each query [3I10| . □ 

In Figured for the sequence (0, 1), (1, 2), (1, 1), (0, 2), (4, 1) of pairs {p\v\,Dp\v\) 
(\<V< 5), the sorted sequence is (0, 1), (0, 2), (1, 2), (1, 1), (4, 1). Thus, D^ is 
12211. B = 0°10(''-°)l0(i-''h0(i-i)l0(*-ih = 110110001. h\\\ = because sx is 
increasing, and &[2] = 1 because S2 is decreasing. 

4 Data Structure for Reverse Dictionary 

In this section, we present a data structure for simulating the naming function 
H defined as follows. For a phrase dictionary D with n symbols, 



B[X,Xj) 



D-'^{X,Xj), if D[k] ^ X^Xj for some l<k< 
X„+ 1 , otherwise . 



For a sufficiently large V, we set a total order on (S U Vy = {XY \ X,Y G 
EUV}, i.e., the lexicographical order of the n^ digrams. This order is represented 
by the range [1, n^]. Then, we recursively define WT To for a phrase dictionary D 
partitioning [l,ri^]. On the root node, the initial range [l,n^] is partitioned into 
two parts: a left range L[l, [(l + n^)J/2] and aright range i?[[(l + n^)J/2 + l,n^]. 
The root is the bit string B such that B[i] = if D[i] e L and B[i] = 1 if D[i] e 
R. By this, the sequence of digrams, D, is decomposed into two subsequences 
Dl and Dn; they are projected on the roots of the left and right subtrees, 
respectively. Each sub-range is recursively partitioned and the subsequence of D 
on a node is further decomposed with respect to the partitioning on the node. 
This process is repeated until the length of any sub-range is one. Let B^ be 
the bit string assigned to the i-th node of Td in the breadth-first traversal. In 
Figure HJ we show an example of such a data structure for a phrase dictionary 
D. 

Theorem 3. The naming function for phrase dictionary D over n ^ \E U V\ 
symbols can be computed by the proposed data structure Dt in O(logn) time for 
any digram. Moreover, when a digram does not exist in the current D , Dt can 
be updated in the same time and the space is at most 2nlogn(l + o(l)) bits. 

Proof. Dt is regarded as a WT for a string S of length n such that any symbol 
is represented in 21ogn bits. Thus, H{XY) is obtained by selectxy('S', 1). The 
query time is bounded by the number of rank and select operations for bit strings 
performed until the operation fiow returns to the root. Since the total range is 
[l,n^], i.e., the height of Tp is at most 21ogn, the query time and the size 
are derived. When XY does not exist in D, let ii,i2, ■ ■ ■ ,ik be the sequence of 
traversed nodes from the root ii to a leaf ik and let B^. be the bit string on 
ij. Given an access/rank/select dictionary for B^., we can update it for B^fo 
and b G {0, 1} in 0(1) time. Therefore, the update time of Tt, for any digram is 
0(fc) = 0(logn). D 



S = {0,1,2,3,4,5} 

U = {EUVf = {00,01,02, 

H{A, !)'> with n = 5 
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ranki (_Bi , 5) = 1 selectj (Bi , 1) = 4 

^^ ^> 

ranki(i?2,l) = 1 selecti(B3, 1) = 1 

ranki(_B5, 1) = 1 solccti (Bj, 1) = 1 
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Fig. 4. WT for reverse dictionary: The bit string Bi is assigned to the i-th node 
in breadth-first order. For each internal node i, we can move to the left child by ranko 
and to the right child by ranki on Bi. The upward traversal is simulated by selecto 
and selecti as shown. The leaf for an existing digram is represented by 1 and null is 
represented by 0, whereas these bits are omitted in this figure. 



5 Discussion 



We have investigated three problems related to the construction of an SLP: the 
information-theoretic lower bound for representing the phrase dictionary D, an 
optimal representation of a directly addressable D, and a dynamic data structure 
for D~^. Here, we consider the results of this study from the viewpoint of open 
questions. 

For the first problem, we approximately estimated the size of a set of SLPs 
with n symbols, which is almost equal to the exact set. This problem, however, 
has several variants, e.g., the set of SLPs with n symbols deriving the same string, 
which is quite difficult to estimate owing to the NP-hardness of the smallest CFG 
problem. There is another variant obtained by a restriction: Any two different 
variables do not derive the same digram, i.e., Z — > XY and Z' -^ XY do not 
exist simultaneously for Z ^ Z' . Although such variables are not prohibited in 
the definition of SLP, they should be removed for space efficiency. On the other 
hand, even if we assume this restriction, the information-theoretic lower bound 
is never smaller than logn! bits because, given a directed chain of length n as 
Tj,, we can easily construct [n — 1)! admissible DAGs. 

For the second problem, we proposed almost optimal encoding of SLPs. From 
the standpoint of massive data compression, one drawback of the proposed en- 
coding is that the whole phrase dictionary must be stored in memory beforehand. 
Since symbols must be sorted, we need a dynamic data structure to allow the 
insertion of symbols in an array, e.g., |15) . Such data structures, however, require 
O(nlogn) bits of space. 

For the last problem, the query time and update time of proposed data 
structure are both O(logn). This cost is considerable and it is difficult to improve 
it to O (log logn) because D is not static. When focusing on the characteristics 



of SLPs, we can improve the query time probabilistically; since any symbol X 
appears in D at least once and |-D| = 2n, the average of frequency oi X is at most 
two. Thus, using an additional array of size nlogn bits, we can check H{XY) 
in 0(1) time with probability at least 1/2. However, improving this probability 
is not easy. For this problem, achieving 0(1) amortized query time is also an 
interesting challenge. 
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